DisMo: A Morphosyntactic, Disfluency and Multi-Word Unit Annotator. An Evaluation on a Corpus of French Spontaneous and Read Speech

نویسندگان

  • George Christodoulides
  • Mathieu Avanzi
  • Jean-Philippe Goldman
چکیده

We present DisMo, a multi-level annotator for spoken language corpora that integrates part-of-speech tagging with basic disfluency detection and annotation, and multi-word unit recognition. DisMo is a hybrid system that uses a combination of lexical resources, rules, and statistical models based on Conditional Random Fields (CRF). In this paper, we present the first public version of DisMo for French. The system is trained and its performance evaluated on a 57k-token corpus, including different varieties of French spoken in three countries (Belgium, France and Switzerland). DisMo supports a multi-level annotation scheme, in which the tokenisation to minimal word units is complemented with multi-word unit groupings (each having associated POS tags), as well as separate levels for annotating disfluencies and discourse phenomena. We present the system’s architecture, linguistic resources and its hierarchical tag-set. Results show that DisMo achieves a precision of 95% (finest tag-set) to 96.8% (coarse tag-set) in POS-tagging non-punctuated, sound-aligned transcriptions of spoken French, while also offering substantial possibilities for automated multi-level annotation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reconstructing False Start Errors in Spontaneous Speech Text

This paper presents a conditional random field-based approach for identifying speaker-produced disfluencies (i.e. if and where they occur) in spontaneous speech transcripts. We emphasize false start regions, which are often missed in current disfluency identification approaches as they lack lexical or structural similarity to the speech immediately following. We find that combining lexical, syn...

متن کامل

應用不定長度特徵之條件隨機域於口語不流暢語流修正 (Disfluency Correction of Spontaneous Speech using Conditional Random Fields with Variable Length Features) [In Chinese]

This paper presents an approach to detecting and correcting edit disfluency based on conditional random fields with variable-length features. The variable-length features consist of word, chunk and sentence features. Conditional random fields (CRF) are adopted to model the properties of the edit disfluency, including repair, repetition and restart, for edit disfluency detection. For the evaluat...

متن کامل

The IFCASL Corpus of French and German Non-native and Native Read Speech

The IFCASL corpus is a French-German bilingual phonetic learner corpus designed, recorded and annotated in a project on individualized feedback in computer-assisted spoken language learning. The motivation for setting up this corpus was that there is no phonetically annotated and segmented corpus for this language pair of comparable of size and coverage. In contrast to most learner corpora, the...

متن کامل

Identification of High-Frequency Morphosyntactic Structures in Persian-Speaking Children Aged 4-6 Years: A Qualitative Research

Background: Syntax has a high importance among linguistic parameters and the prevalence of syntax deficits is relatively high in children with language disorders. As such, independent examination of syntax in language development is of paramount importance. In this regard, Iranian language pathologists are faced with the lack of standardized tests. The present study aimed to determine the most ...

متن کامل

How are word-final schwas different in the north and south of france?

The aim of this paper is twofold: (i) give a large-scale description in realized word-final schwas of French lexical words for different regions (North vs. South) and different speaking styles (read vs. spontaneous speech); (ii) highlight differences in prosodic features and test these differences via automatic classification techniques. The proposed study relies on a subset of 12.5 hours of th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014